vttablet: handle applier metadata init failures in relay-log recovery#19560
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19560 +/- ##
===========================================
- Coverage 69.67% 59.96% -9.72%
===========================================
Files 1614 109 -1505
Lines 216793 17853 -198940
===========================================
- Hits 151044 10705 -140339
+ Misses 65749 7148 -58601
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
`handleRelayLogError` currently retries replication restart for known recoverable metadata-init failures (relay log info and master info). MySQL can also return: ``` Replica failed to initialize applier metadata structure from the repository ``` This treats this error as the same recoverable class by triggering `RestartReplication` (STOP REPLICA, RESET REPLICA, START REPLICA). Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
c82ddf5 to
721e63e
Compare
| // The same fix also works for https://github.com/vitessio/vitess/issues/10955. | ||
| if strings.Contains(err.Error(), "Replica failed to initialize relay log info structure from the repository") || | ||
| strings.Contains(err.Error(), "Could not initialize master info structure") { | ||
| if isRecoverableReplicationInitializationError(err) { |
There was a problem hiding this comment.
I guess we don't have access to the MySQL error codes here? It feels quite brittle to check against the error message strings - but if that's the only thing we can do here it's fine.
There was a problem hiding this comment.
Yeah I was thinking the same thing, but it gets flattened earlier: https://github.com/vitessio/vitess/blob/main/go/vt/mysqlctl/query.go?plain=1#L84. It's definitely doable and preferable, but I think it'd require a bit of a refactor that I'll leave as a follow-up.
There was a problem hiding this comment.
+1 that sqlerror.NewSQLErrorFromError sucks, but it's in other code 🤷
I just wanted to point out many RPCs map sqlerrors -> vterrors.Code. For example, we return vtrpcpb.Code_UNAVAILABLE if the error is of a class that probably means unavailable
mattlord
left a comment
There was a problem hiding this comment.
The idea is good, but a number of suggestions. Also some testing gaps identified by Claude:
Testing Gaps
4. Unit test doesn't verify substring matching with MySQL error prefixes
TestHandleRelayLogError creates errors via errors.New(constantString), testing exact matches. In production, MySQL errors arrive wrapped with
errno/sqlstate:
ERROR 1872 (HY000): Replica failed to initialize relay log info structure from the repository
ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log
The strings.Contains logic handles this correctly, but the unit test doesn't prove it. Adding one test case with a realistic MySQL-wrapped error
message would catch any future regression if someone changed Contains to e.g. HasPrefix or ==:
{
name: "applier metadata error with MySQL prefix",
inputErr: errors.New("ERROR 1872 (HY000): Replica failed to initialize applier metadata structure from the repository"),
shouldRestart: true,
},
5. Planned reparent integration tests don't cover masterInfoInitializationError
The relayErrors table in TestPlannedReparentShardRelayLogError and TestPlannedReparentShardRelayLogErrorStartReplication covers "relay log info"
and "applier metadata" but omits "master info". If you're converting to table-driven, including all three error variants is cheap and gives
complete coverage. The reparent_utils_test.go does cover "master info", so cross-file coverage exists, but within a single test function the gap is
inconsistent.
6. TestHandleRelayLogError "unrelated error" case doesn't verify the error is returned unmodified
The test checks require.ErrorIs(t, err, tc.inputErr) which is correct, but it doesn't verify that RestartReplication was NOT called (no
CheckSuperQueryList verification for the non-restart path). The empty ExpectedExecuteSuperQueryList handles this implicitly since
CheckSuperQueryList would fail if unexpected queries ran, but only if the test actually calls CheckSuperQueryList — which it does at the bottom. So
this is fine, just subtle.
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
…path The reparent branch that restarts replication after a no-op source check still called `StartReplication` inline and routed the error through `handleRecoverableReplicationInitError` manually. That duplicated the helper we already use everywhere else for the same start recovery behavior. This switches that branch to call `startReplicationRecoverable` directly so the explicit START REPLICA recovery stays in one place. Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
The reparent restart branch still open-coded a recoverable `STOP REPLICA` path while start recovery had already been pulled into a helper. That left the stop handling inconsistent and made the restart path harder to read. This adds `stopReplicationRecoverable`, uses it where the reparent restart flow already had identical recoverable-stop behavior, and adds a focused unit test for the new helper. Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4762624b01
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
`SetReplicationSource` source-change failures were still handled with restart-style recovery, which can resume the old source before the requested one is known to be stored. The same recovery shape was also awkward for `STOP REPLICA`, where a recoverable stop failure could return success with replication running. This changes `setReplicationSourceRecoverable` to repair recoverable source-change errors by `ResetReplicationParameters`, reapply the requested source, and only start replication when requested. It also stops attempting recoverable handling for `STOP REPLICA`, updates the helper comments, and adds regression coverage for running and non-running replicas. Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
|
Made some changes worth noting:
|
MySQL requires the replica SQL and I/O threads to be stopped before `RESET REPLICA [ALL]`, but the new `SetReplicationSource` recovery path could reach `ResetReplicationParameters` immediately after a failed source-change attempt on a running replica. That left the recovery logic depending on an unstated assumption about the failed `SetReplicationSource(..., stopReplicationBefore=true)` call. This now issues an explicit `StopReplication` before `ResetReplicationParameters` when the replica was running, and updates the running-replica recovery test to expect that extra stop. Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
The replication errno alias block in `go/mysql/sqlerror/constants.go` kept the old aligned spacing on `ERInnodbReadOnly`, which fails `gofumpt`. This reapplies the formatter output so CI accepts the file. Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f8320ad1f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := tm.MysqlDaemon.StopReplication(ctx, tm.hookExtraEnv()); err != nil { | ||
| if err := tm.handleRelayLogError(ctx, err); err != nil { | ||
| return err | ||
| } | ||
| return err | ||
| } |
There was a problem hiding this comment.
Recover metadata-init errors when STOP REPLICA fails
In the status.SourceHost == host && status.SourcePort == port && shouldbeReplicating path, StopReplication errors are now returned directly, so recoverable metadata-init failures (1201/1871/1872) from STOP REPLICA no longer trigger the reset-and-restart self-heal. This branch is exercised when the source is already correct (e.g. no-op/planned reparent and force-start flows), so a transient/corrupt replication metadata condition that previously recovered can now abort reparent/replication setup instead of repairing itself.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This is intentional, see #19560 (comment)
`TestPlannedReparentShardRelayLogError` still expected `PlannedReparentShard` to succeed when `STOP REPLICA` returned a recoverable metadata-init error, but the tabletmanager change intentionally removed stop recovery. That made the wrangler test fail in CI even though the supported `START REPLICA` recovery path still works. This changes the stop-error PRS test to expect `SetReplicationSource` to fail and keeps the start-error companion test as the success-path coverage for the recovery we still support. Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37ce03c297
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return nil | ||
| } | ||
|
|
||
| return tm.startReplicationRecoverable(ctx) |
There was a problem hiding this comment.
Avoid running postflight hook in SetReplicationSource path
When shouldStartReplication is true, this helper now starts replication via startReplicationRecoverable, which calls MysqlDaemon.StartReplication and therefore runs the postflight_start_slave hook. Previously, SetReplicationSource(..., startReplicationAfter=true) performed CHANGE REPLICATION SOURCE + START REPLICA without invoking that hook, so this change can make reparent/init flows fail in environments where the postflight hook is present and returns an error, even though source reconfiguration and SQL start succeeded. Please preserve the previous hook behavior for source-change starts (or make hook execution explicit/opt-in for this path).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This is certainly a behavior change, but I'm not sure it's a wrong change. I'd expect the hook to run in this case, but it may cause some unexpected behavior.
| // The same fix also works for https://github.com/vitessio/vitess/issues/10955. | ||
| if strings.Contains(err.Error(), "Replica failed to initialize relay log info structure from the repository") || | ||
| strings.Contains(err.Error(), "Could not initialize master info structure") { | ||
| if isRecoverableReplicationInitializationError(err) { |
There was a problem hiding this comment.
+1 that sqlerror.NewSQLErrorFromError sucks, but it's in other code 🤷
I just wanted to point out many RPCs map sqlerrors -> vterrors.Code. For example, we return vtrpcpb.Code_UNAVAILABLE if the error is of a class that probably means unavailable
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
…vitessio#19560) Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
…vitessio#19560) Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
…vitessio#19560) Signed-off-by: Mohamed Hamza <mhamza@fastmail.com> Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Description
handleRelayLogErrorcurrently retries replication restart for known recoverable metadata-init failures (relay log info and master info). MySQL can also return:This treats this error as the same recoverable class by triggering
RestartReplication(STOP REPLICA, RESET REPLICA, START REPLICA).Related Issue(s)
Fixes #19612
Checklist
Deployment Notes
AI Disclosure
Help from Codex.